Abstract

Background
Metagenomic next-generation sequencing (mNGS) has emerged as a powerful tool for the unbiased detection of pathogens during outbreak investigations.  However, its effectiveness can be limited by the high abundance of host-derived nucleic acid (NA), which reduces pathogen detection sensitivity. Long-read sequencing technologies offer key advantages in genome assembly and pathogen resolution, but its performance depends on sample type and enrichment protocol. In response to an outbreak of an unknown febrile illness in Panzi, Democratic Republic of the Congo (DRC) in late 2024, mNGS were used as tool for lab investigation. Here, we evaluated and compared three long-read mNGS protocols used during investigation: standard SISPA using random nanomers, and eSISPA using a set of 96 non-ribosomal primers with and without PtCl4 sample treatment. We evaluated sequencing quality, host depletion efficiency, and taxonomic profiling using CZID webtool. We applied a background model with a z-score > 1 threshold to filter taxa and focus on clinically relevant pathogens.

Results
We re-sequenced 15 samples (8 blood samples and 7 oral swabs).  eSISPA-based protocols showed high sequencing yield, enhanced host NA depletion and detection sensitivity for diverse pathogens compared to standard SISPA. Taxonomic profiling showed protocol-dependent differences in microbial diversity and abundance, with PtCl4 eSISPA and Standard SISPA detecting the highest number of unique taxa, in blood and swab samples, respectively. PtCl4 eSISPA detected bacteria identified by qPCR; Klebsiella in 5 samples, eSISPA detected Salmonella and Klebsiella in 3 samples, each, standard SISPA detected Plasmodium in 6 samples. However, eSISPA-based protocols failed to recover Plasmodium reads in 5 samples despite positive diagnostics, likely due to sample degradation, as experiments were not performed in parallel. Venn diagram highlighted both shared and unique taxa across protocols, underscoring the impact of protocol selection on metagenomic outcomes. 

Conclusions

Our study reveals protocol-dependent differences in sequencing yield and host nucleic acid depletion, especially in swab samples. In blood, trade-offs affect consistency, host depletion, and pathogen sensitivity. No single method proved universally optimal, highlighting the need for parallel testing with high-quality, pathogen-positive samples. However, eSISPA-based protocols is promissing for outbreak investigations by depleting host-NA and improving pathogen detection, though further validation is required.

Keywords:

Metagenomics, long-reads sequencing, SISPA, eSISPA, pathogen detection, host depletion, outbreak investigation, Democratic Republic of the Congo, Oxford Nanopore Technologies

Data analysis

Install packages and load libraries

if(!require(pacman)) install.packages("pacman")
pacman::p_load( 
                broom, flexdashboard, drc, esquisse, flextable, ggplot2, ggpubr, 
                ggsignif, gridExtra, gt, gtsummary, here, inspectdf, 
                janitor, officer, patchwork,purrr, rstudioapi, scales, 
                stringr, tidyverse, vegan, webshot2, rlang 
             )

Set path for working directory

WD <- here()
output_dir <- here("outputs")

Download data

# Dataset for raw data sequencing  QC 

read_QC           <- read_csv(here("data", "data_fig_1_QC_NanoPlot.csv"))
panzi_raw_data    <- read_csv(here("data", "data_fig_2_table_1_raw_seqkit_stat.csv"))
minimap_host      <- read_csv(here("data", "data_fig_3a_3b_minimap_report_summary_2.csv"))

# CZID dataset for normalized reads distribution

esispa_blood_hist     <- read_csv(here("data", "data_fig_4a_czid_esispa_blood_all.csv"))
esispa_swab_hist      <- read_csv(here("data", "data_fig_4b_czid_esispa_swab_all.csv"))
std_sispa_blood_hist  <- read_csv(here("data", "data_fig_4c_czid_sispa_blood_all.csv"))
std_sispa_swab_hist   <- read_csv(here("data", "data_fig_4d_czid_sispa_swab_all.csv"))

# Z-score dataset for normalized reads distribution

z_score_esispa_blood_hist <- read_csv(here("data", "data_fig_5a_czid_zscore_esispa_blood_all.csv"))
z_score_esispa_swab_hist  <- read_csv(here("data", "data_fig_5b_czid_zscore_esispa_swab_all.csv"))
z_score_sispa_blood_hist  <- read_csv(here("data", "data_fig_5c_czid_zscore_sispa_blood_all.csv"))
z_score_sispa_swab_hist   <- read_csv(here("data", "data_fig_5d_czid_zscore_sispa_swab_all.csv"))

Data analysis and figures design

Fig.1(a-b) : Sequencing raw data qualitity control using NanoPlot and NanoStat

Table 1-2 : Summary Tables of Raw SeqKit Statistics for Panzi Samples

The output file is saved as word document in the “outputs” folder

Fig.2(a-b) : Comparison of total bases (a) and total reads (b) generated per samples using seqkit statistics ouput

Fig.3 (a-b) Distribution of reads per host and non host per protocol and sample

Fig.4-5(a-d) : Distribution of normalized reads per million based on the Z-score background model distribution

Fig.4(a-d) : Distribution of reads per millions without filtering the background noise

Fig.5(a-d) : Distribution of z-score and setting the filtering threshold from the background noise